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Abstract. The amount of completely sequenced chloroplast genomes 
increases rapidly every day, leading to the possibility to build large scale 
phylogenetic trees of plant species. Considering a subset of close plant 
species defined according to their chloroplasts, the phylogenetic tree that 
can be inferred by their core genes is not necessarily well supported, 
due to the possible occurrence of “problematic” genes (Le., homoplasy, 
incomplete lineage sorting, horizontal gene transfers, etc.) which may 
blur phylogenetic signal. However, a trustworthy phylogenetic tree can 
still be obtained if the number of problematic genes is low, the problem 
being to determine the largest subset of core genes that produces the best 
supported tree. To discard problematic genes and due to the overwhelming 
number of possible combinations, we propose an hybrid approach that 
embeds both genetic algorithms and statistical tests. Given a set of 
organisms, the result is a pipeline of many stages for the production 
of well supported phylogenetic trees. The proposal has been applied to 
different cases of plant families, leading to encouraging results for these 
families. 
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1 Introduction 

The multiplication of complete chloroplast genomes should normally lead to 
the ability to infer trustworthy phylogenetic trees for plant species. Indeed, the 
existence of trustworthy coding sequence prediction and annotation software 
specific to chloroplasts (like DOGMA [1]), together with the good control of 
sequence alignment and maximum likelihood or Bayesian inference phylogenetic 
reconstruction techniques, should imply that, given a set of close species, their 
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core genome (the set of genes in common) can be as large and accurately detected 
as possible to finally obtain a well-supported phylogenetic tree. However, all genes 
of the core genome are not necessarily constrained in a similar way, some genes 
having a larger ability to evolve than other ones due to their lower importance. 
Such minority genes tell their own story instead of the species one, blurring so 
the phylogenetic information. 


To obtain well-supported phylogenetic trees, the deletion of these problematic 
genes (which may result from homoplasy, stochastic errors, undetected paralogy, 
incomplete lineage sorting, horizontal gene transfers, or even hybridization) is 
needed. A solution is to construct the phylogenetic trees that correspond to all the 
combinations of core genes, and to finally consider the tree that is as supported 
as possible while considering as many genes as possible. The major drawback is 
its inhibitory computational cost, since testing all the possible combinations is 
totally intractable in practice (2^ phylogenetic tree reconstructions with n ~ 100 
core genes of plants belonging to the same order). Thus we have to remove the 
problematic genes without exhaustively testing combinations of genes. Therefore, 
our proposal is to mix various approaches to extract promising subsets of core 
genes, encompassing systematic deletion of genes, random selection of large 
subsets, statistical evaluation of gene effects, and genetic algorithms (GAs) |2I3| . 
These latters are efficient, robust, and adaptive search techniques designed for 
solving optimization problems, which have the ability to produce semi-optimal 
solutions n. 


The contribution of this article can be summarized as follows. We focus on 
situations where a large number of genes are shared by a set of species so that, in 
theory, enough data are available to produce a well supported phylogenetic tree. 
However, a few genes tell a different evolutionary scenario than the majority of 
sequences, leading to phylogenetic noise blurring the phylogeny reconstruction. 
The pipeline that we propose attempts to solve such an issue by computing all 
phylogenetic trees which can be obtained by removing at most one core gene. In 
case where such a preliminary systematic approach does not solve the phylogeny, 
new investigation stages are added to the pipeline, namely a Monte-Carlo based 
random approach and two invocations of a genetic algorithm, separated by a 
Lasso test. The pipeline is finally tested on various sets of chloroplast genomes. 


The remainder of this article is as follows. We start with Section by giving 
a brief and global description of the problem. Genetic population initialization is 
discussed in Section while the first optimization stage with genetic algorithm 
is fully detailed in Section]^ Targeting problematic genes using a Lasso test and 
the following second invocation of the genetic algorithm is detailed in Section 
Then, in the next section, various plant families are tested as a case study. Finally 
this research work ends with a conclusion section in which the contributions are 
summarized and intended future work is outlined. 
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2 Presentation of the problem 

Let us consider a set of chloroplast genomes that have been annotated using 
DOGMA [1] (http://dogma.ccbb.utexas.edu/). We have then access to the 
core genome [7] (genes present everywhere) of these species, whose size is about 
one hundred genes when the species are close enough. For further information 
on how we found the core genome, see m Sequences are further aligned with 
MUSCLE [9] and the RAxML HO] tool infers the corresponding phylogenetic 
tree. If this resulting tree is well-supported, then the process is stopped without 
further investigations. Indeed, if all bootstrap values are larger than 95, then we 
can reasonably consider that the phylogeny of these species is resolved, as the 
largest possible number of genes has led to a very well supported tree. 

In case where some branches are not supported well, we can wonder whether 
a few genes can be incriminated in this lack of support. If so, we face an 
optimization problem: find the most supported tree using the largest subset of eore 
genes. Obviously, a brute force approach investigating all possible combinations 
of genes is intractable, as it leads to 2^ phylogenetic tree inferences for a core 
genome of size n. To solve this optimization problem, we have proposed an hybrid 
approach mixing a genetic algorithm with the use of some statistical tests for 
discovering problematic genes. The initial population for the genetic algorithm is 
built by both systematic and random pre-GA investigations. These considerations 
led to a pipeline detailed in Figure whose stages will be developed thereafter. 
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Fig. 1. Overview of the proposed pipeline for phylogenies based on chloroplasts. 


3 Generation of the initial population 

The objective is to obtain a well-supported phylogenetic tree by using the largest 
possible subset of genes. If this goal cannot be reached by taking all core genes, the 
first thing to investigate is to check whether one particular gene is responsible of 
this problem. Therefore we systematically compute all the trees we can obtain by 
removing exactly one gene from the core genome, leading to n new phylogenetic 
trees, where n is the core size (see Figure [^. 
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Fig. 2. Binary mapping operation and genetic algorithm overview, (a) Initial individuals 
obtained in systematic mode stage. Two kinds of individuals are generated. First, by 
considering all genes in the core genome. Second, by omitting one gene sequentially 
depending on the core length, (b) Initial individuals are generated randomly in random 
stage (random mode) by omitting 2-10 genes randomly. 


If, during this systematic approach, one well-supported tree is obtained, then 
it is returned as the phylogeny of the species under consideration. Conversely, if 
all trees obtained have at least one problematic branch, then deeper investigations 
are required. However the systematic approach has reached its limits which is 
preliminary to GA, as investigating all phylogenetic trees that can be obtained 
by removing randomly 2 genes among a core genome of size n leads to 
tree inferences. Obviously, the number of cases explodes, and it is illusory to 
hope to investigate all reachable trees by discarding 10% of a core genome having 
100 genes. This is why a genetic algorithm has been proposed. 

Using the n + 1 computed trees to initialize the population of the genetic 
algorithm results in a population which remains too small and too homogeneous. 
Indeed, all these trees have been computed in the same way, each inference being 
produced using 99% of the core genome (we have removed at most 1 gene in a 
core genome having approximately 100 genes). Thus, in order to increase the 
diversity of the initial population a second stage (random stage) as shown in 
Figure]^ which extracts large random subsets of core genome for inference, is 
applied. 

More precisely, there are indeed two random stages. The first one operates 
during 200 iterations: at each iteration, an integer k between 2 and 10 is first 
randomly picked. Then k genes are randomly removed from the core genome, 
and a phylogeny is inferred using the remaining genes. If during these iterations, 
by chance, a very well supported tree is obtained, a stop signal is sent to the 
master process and the obtained tree is returned. If not, we now have enough 
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data to build a relevant initial population for the genetic algorithm. And the 
second random stage is indeed included in this genetic algorithm. 


4 Genetic algorithm 


A genetic algorithm is a well-known metaheuristic which has been described 
by a rich body of literature since its introduction in the mid-seventies HHH]. 
In the following, we will only discuss the choices we made regarding operators 
and parameters. For further information and applications regarding the genetic 
algorithm, see, e.g.. 




4.1 Genotype and fitness value 

Genes of the core genome are supposed to be ordered lexicographically. At each 
subset s of the core genome corresponds thus a unique binary word w of length 
n: for each i lower than n, is 1 if the i-th core gene is in s, else Wi is equal to 
0. At each binary word w of length n, we can associate its percentage p of I’s 
and the lowest bootstrap b of the phylogenetic tree we obtain when considering 
the subset of genes associated to w. At each word w we can thus associate as 
fitness value the score 6 -h p, which must be as large as possible. We currently 
consider that bootstrap b and the number of genes p have the same importance 
in the scoring function. However, changing the weight of each parameter may be 
interesting in deeper investigations. 


4.2 Genetic process 

Until now, binary words (genotypes) of length n that have been investigated are: 

1. the word having only I’s (systematic mode); 

2. all words having exactly one 0 (systematic mode); 

3. 200 words having between 2 and 10 O’s randomly located (random mode). 


To each of these words is attached a score which is used to select the 50 best words, 
or fittest individuals, in order to build the initial population. After that, the 
genetic algorithm will loop during 200 iterations or until an offspring word such 
that 5 ^ 95 is obtained. During an iteration the algorithm will apply the following 
steps to produce a new population P' given a population P (see Figure]^. 


Repeat 5 times a random pickup of a couple of words and mix them using a 
crossover approach. The obtained words are added to the population P, as 
described in Section [T^ resulting in population Pc. 

Mutate 5 words of the population Pc, the mutated words being added too to 

Pc, as detailed in Section [O} leading to population P^. _ 

Add 5 new random binary words having less than 10% of O’s (see Section 4.51 
to Pm producing population Pj.. 

Select the 50 best words in population P^ to form the new population P'. 


Let us now explain with more details each step of this genetic algorithm. 
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Fig. 3. Outlines of genetic algorithm. 


4.3 Crossover step 

Given two words and the idea of the crossover operation is to mix them, 
hoping by doing so to generate a new word w having a better score (see Figure [4a|). 
For instance, if we consider a one-point crossover located at the middle of the 
words, for i < ^^ Wi = wj^ while for i ^ = ref: in that case, for the first 

core genes, the choice (to take them or not for phylogenetic construction) in w is 
the same than in while the subset of considered genes in w corresponds to 
the one of for the last 50% of core genes. 

More precisely, at each crossover step, we first pick randomly an integer 
k lower than and randomly again k different integers such that 

1 < ii < i 2 < ... < ik < n. Then and w‘^ are randomly selected from the 
population P, and a new word w is computed as follows: 

— Wi = wj for i = 1,..., ii, 

- Wi = w‘1 for i = p -h 1 ,..., 12 , 

- Wi = wj for i = i 2 + 1, h, 

— etc. 

Then the phylogenetic tree based on the subset of core genes labeled by w is 
computed, the score s of re is deduced, and w is added to the population with 
the fitness value s attached to it. 

4.4 Mutation step 

In this step, we ask whether changing a little a given subset of genes, by removing 
a few genes and adding a few other ones, may by chance improve the support 
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Fig. 4. (a) Two individuals are selected from given population. First portion from 
determined crossover position in the first individual is switched with the first portion of 
the second individual. The number of crossover positions is determined by Ncrossover • 
(b) Random mutations are applied depending on the value of Nmutation, changing 
randomly gene state from 1 to 0 or vice versa. New offsprings generated from this stage 
are predicted w.r.t natural evolution scenario. 
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Fig. 5. Random pair selections from given population. 


of the associated tree. Similarly speaking, we try here to improve the score of a 
given word by replacing a few O’s by 1 and a few I’s by 0. 

In practice, an integer k ^ j corresponding to the number of changes, or 
‘‘mutations”, is randomly picked. Then k different integers ii,... , 2 /^ lower than 
n are randomly chosen and a word w is randomly extracted from the current 
population. A new word w' is then constructed as follows: for each i = 1, ...,n, 

— if i in {ii,..., i/c}, then re• = + 1 mod 2 (the bit is mutated), 

— else w[ = Wi (no modification). 

Again, the phylogenetic tree corresponding to the subset of core genes associated 
to w' is computed, and w' is added to the population together with its score. 

























AlKindy B. et al. 


4.5 Random step 


In this step, new words having a large amount of I’s are added to the population. 


Each new word is obtained by starting from the word having n Is, followed by k 
random selection of Is which are changed to 0, where k is an integer randomly 
chosen between 1 and 10. The new word is added to the population after having 
computed its score thanks to a phylogenetic tree inference. 

5 Targeting problematic genes using statistical tests 

5.1 The Lasso test 

After having carried out 200 iterations of the genetic algorithm detailed above, it 
may occur that no well-supported tree has been produced. Various reasons may 
explain this failure, like a lazy convergence speed, a large number of problematic 
genes (e.^., homoplasic ones, or due to stochastic errors, undetected paralogy, 
incomplete lineage sorting, horizontal gene transfers, or hybridization), or close 
divergence species leading to very small branch lengths between two internal 
nodes. However, we now have computed enough word scores to determine the 
effects of each gene in topologies and bootstraps, and to remove the few genes 
that break supports. 

The idea is then to investigate each topology that has appeared enough 
times during previous computations. In this study, we only consider topologies 
having a frequency of occurrence larger than 10%. Remark that this 10% is 
convenient for the given case study, but it must depend in fact on the number of 
obtained topologies. Then for each best word of these best topologies, and for 
each problematic bootstrap in its associated tree, we apply a Lasso approach as 
follows. 

The Lasso (Least Absolute Shrinkage and Selection Operator) test [15] is an 
estimator that takes place in the category of least-squares regression analysis. 
Like all the algorithms in this group, it estimates a linear model which minimizes 
a residual sum with respect to a variable A. Let us explain how this variable 
can be used to order genes with respect to their ability to modify the bootstrap 
support. 

Let X he a, m X p matrix where each line Xi = {Xu, ..., Xij, ... Xip), 1 < 
i < m, is a configuration where Xij is 1 if gene number j is present inside the 
configuration i and Xij is 0 otherwise. For each X^, let Yi be the real positive 
support value for each problematic bootstrap b per topology and per gene. 
According to m, the Lasso test P = {^i,..., pi,... Pp) is defined by 



( 1 ) 


When A has high value, all the Pj are null. It is thus sufficient to decrease 
the value of A to observe that some Pj become not null. Moreover, the sign of Pj 
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is positive (resp. negative) if the bootstrap support increases (resp. decreases) 
with respect to j. 


5.2 Second stage of genetic algorithm 

Targeting problematic genes using Lasso approach can solve the issue of badly 
supported values in some cases, especially when only one support is lower than the 
predefined threshold. In cases where at least two branches are not well supported, 
removing genes that break the first support may or may not has an effect on 
the second problematic support. In other words, each of the two problematic 
supports can be separately solved using Lasso investigations, but not necessarily 
both together. 

However, the population has been improved, receiving very interesting words 
for each problematic branch. Then a last genetic algorithm phase is launched 
on the updated population, in order to mix these promising words by crossover 
operations, hoping by doing so to solve in parallel all of the badly supported 
values. This last stage runs until either the resolution of all problematic bootstraps 
or the reach of iterations limit (set to 1000 in our simulations). 


6 Case studies 

6.1 Pipeline evaluation on various groups of plant species 

In this section, the proposed pipeline is tested on various sets of close plant 
species. An example of 50 subgroups (ranging on average from 12 to 15 chloroplasts 
species) encompassing 356 plant species is presented in Table The Stage column 
contains the termination step for each subgroup, namely: the systematic (code 
1), random (2), or optimization stages (3) using genetic algorithm and/or Lasso 
test. A large occurrence in this table means that the associated group and/or 
subgroups has its computation terminated in either penultimate or last pipeline 
stage. An occurrence of 31 is frequent due to the fact that 32 MPI threads (one 
master plus 31 slaves) have been launched on our supercomputer facilities. Notice 
that the Tableis divided into four parts: groups of species stopped in systematic 
stage with weak bootstrap values (which is due to the fact that a upper time 
limit has been set for each group and/or subgroups, while each computed tree 
in these remarkable groups needed a lot of times for computations), subgroups 
terminated during systematic stage with desired bootstrap value, groups or 
subgroups terminated in random stage with desired bootstrap value, and finally, 
groups or subgroups terminated during optimization stages. The majority of 
subgroups has its phylogeny satisfactorily resolved, as can be seen on all obtained 
trees which are downloadable at http://meso.univ-fcomte.fr/peg/phylo, In 
what follows, an example of problematic group, namely the Apiales^ is more 
deeply investigated as a case study. 
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Table 1. Families applied on pipeline stages 


Group or subgroup 

Occurrences 

Core genes 

4^ Species 

L. Bootstrap 

Pip. Stage 

Likelihood 

Outgroup 

Gossypium group 0 

85 

84 

12 

26 

1 

-84187.03 

Theo cacao 

Ericales 

674 

84 

9 

67 

3 

-86819.86 

Dauc carota 

Eucalyptus group 1 

83 

82 

12 

48 

1 

-62898.18 

Gory gummifera 

C ary ophy Hales 

75 

74 

10 

52 

1 

-145296.95 

Goss capitis-viridis 

Brassicaceae group 0 

78 

77 

13 

64 

1 

-101056.76 

Gari papaya 

Orohanchaceae 

26 

25 

7 

69 

1 

-19365.69 

Olea maroccana 

Eucalyptus group 2 

87 

86 

11 

71 

1 

-72840.23 

Stoc quadrifida 

Malpighiales 

1183 

78 

12 

80 

3 

-95077.52 

Mill pinnata 

Pinaceae group 0 

76 

75 

6 

80 

1 

-76813.22 

Juni virginiana 

Pinus 

80 

79 

11 

80 

1 

-69688.94 

Pice_ sitchensis 

Bambusoideae 

83 

81 

11 

80 

3 

-60431.89 

Oryz_ nivara 

Chlorophyta group 0 

231 

24 

8 

81 

3 

-22983.83 

Olea_ europaea 

Marchantiophyta 

65 

64 

5 

82 

1 

-117881.12 

Pice_ abies 

Lamiales group 0 

78 

77 

8 

83 

1 

-109528.47 

Gaps_ annuum 

Rosales 

81 

80 

10 

88 

1 

-108449.4 

Glyc_ soja 

Eucalyptus_ group_ 0 

2254 

85 

11 

90 

3 

-57607.06 

Allo_ternata 

Prasinophyceae 

39 

43 

4 

97 

1 

-66458.26 

Oltm_ viridis 

Asparagales 

32 

73 

11 

98 

1 

-88067.37 

Acor americanus 

Magnoliidae_ group_ 0 

326 

79 

4 

98 

3 

-85319.31 

Sacc_SP80-3280 

Gossypium_ group _ 1 

66 

83 

11 

98 

1 

-81027.85 

Theo_ cacao 

Triticeae 

40 

80 

10 

98 

1 

-72822.71 

Loli_ perenne 

Gorymbia 

90 

85 

5 

98 

2 

-65712.51 

Euca_ salmonophloia 

Moniliformopses 

60 

59 

13 

100 

1 

-187044.23 

Prax_ clematidea 

Magnoliophyta group 0 

31 

81 

7 

100 

1 

-136306.99 

Taxu mairei 

Liliopsida group 0 

31 

73 

7 

100 

1 

-119953.04 

Drim granadensis 

basal Magnoliophyta 

31 

83 

5 

100 

1 

-117094.87 

Ascl nivea 

Araucariales 

31 

89 

5 

100 

1 

-112285.58 

Taxu mairei 

Araceae 

31 

75 

6 

100 

1 

-110245.74 

Arun gigantea 

Embryophyta group 0 

31 

77 

4 

100 

1 

-106803.89 

Stau punctulatum 

Gupressales 

87 

78 

11 

100 

2 

-101871.03 

Podo totara 

Ranunculales 

31 

71 

5 

100 

1 

-100882.34 

Gruc wallichii 

Saxifragales 

31 

84 

4 

100 

1 

-100376.12 

Aral undulata 

Spermatophyta group 0 

31 

79 

4 

100 

1 

-94718.95 

Mars crenata 

Proteales 

31 

85 

4 

100 

1 

-92357.77 

Trig doichangensis 

Poaceae group 0 

31 

74 

5 

100 

1 

-89665.65 

Typh latifolia 

Oleaceae 

36 

82 

6 

100 

1 

-84357.82 

Boea hygrometrica 

Arecaceae 

31 

79 

4 

100 

1 

-81649.52 

Aegi geniculata 

PAGMAD_clade 

31 

79 

9 

100 

1 

-80549.79 

Bamb_ emeiensis 

eudicotyledons_ group_ 0 

31 

73 

4 

100 

1 

-80237.7 

Eryc_pusilla 

Poeae 

31 

80 

4 

100 

1 

-78164.34 

Trit_ aestivum 

Trebouxiophyceae 

31 

41 

7 

100 

1 

-77826.4 

Ostr_ tauri 

Myrtaceae_ group_ 0 

31 

80 

5 

100 

1 

-76080.59 

Oeno_ glazioviana 

Onagraceae 

31 

81 

5 

100 

1 

-75131.08 

Euca_ cloeziana 

Geraniales 

31 

33 

6 

100 

1 

-73472.77 

Ango _floribunda 

Ehrhartoideae 

31 

81 

5 

100 

1 

-72192.88 

Phyl_ henonis 

Picea 

31 

85 

4 

100 

1 

-68947.4 

Pinu_ massoniana 

Streptophyta_ group_ 0 

31 

35 

7 

100 

1 

-68373.57 

Oedo_ cardiacum 

Gnetidae 

31 

53 

5 

100 

1 

-61403.83 

Gusc_ exaltata 

Euglenozoa 

29 

26 

4 

100 

3 

-8889.56 

Lath sativus 


6.2 Investigating Apiales order 

In our study Apiales choroplasts consist of two sets, as detailed in Table two 
species belong to the Apiaceae family set (namely Daucus carota and Anthriscus 
cerefolium)^ while the remaining seven species are in the Araliaceae family set. 
These latter are: Panax ginseng, Eleutherococcus senticosus, Aralia undulata, 
Brassaiopsis hainla, Metapanax delay ay i, Schejflera delay ayi, and Kalopanax 
septemlobus. Chloroplasts of Apiales are characterized by having highly conserved 
gene content and order M- 













Hybrid Genetic Algorithm and Lasso Test for Inferring Phylogenetic Trees 


11 


Table 2. Genomes information of Apiales 


Organism name 

Accession 

Genome Id 

Sequence length 

Number of genes 

Lineage 

Daucus carota 

NC 

008325.1 

114107112 

155911 bp 

138 

Apiaceae 

Anthriscus cerefolium 

NC’ 

015113.1 

323149061 

154719 bp 

132 

Apiaceae 

Panax ginseng 

NC’ 

006290.1 

52220789 

156318 bp 

132 

Araliaceae 

Eleutherococcus senticosus 

NC’ 

016430.1 

359422122 

156768 bp 

134 

Araliaceae 

Aralia undulata 

NC’ 

_022810.1 

563940258 

156333 bp 

135 

Araliaceae 

Brassaiopsis hainla 

NC’ 

022811.1 

558602891 

156459 bp 

134 

Araliaceae 

Metapanax delavayi 

NC' 

022812.1 

558602979 

156343 bp 

134 

Araliaceae 

Scheffiera delavayi 

NC’ 

022813.1 

558603067 

156341 bp 

134 

Araliaceae 

Kalopanax septemlobus 

NC’ 

022814.1 

563940364 

156413 bp 

134 

Araliaceae 


Method to select best topologies We define T = [to, ti,as a list of 
m = 9, 053 obtained trees from given pipeline. By comparing each tree ti in T 
with the other trees in T, a set of topologies is then numbered and defined as 
W = {rco, rci, 1 ^ 2 , ..., 1 ^^}, where Wi is the topology of number i. Let f{x) be a 
function on W which represents the number of trees having x for their topology. 
We say that a given topology Wi is selected as the best topology if and only if 
> lb where lb is the lower bound threshold computed by the following 
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Fig. 6. Best trees of topologies 0, 11, and 2. 
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formula 


7 is a constant value between 1 — 10 and m is the size of T. Then x is stored as 
best topology. 

Pratical results In our case, 7 = 8 , meaning that we exclude as noise the 
topologies representing less than 8 % from the given trees. Three from 43 identified 
tree topologies are selected, with a number of occurrences f{x) above lb = 724, 
as the best topologies as shown in Table In this table, topologies 0 and 11 are 
delivered from optimization stages when the desired bootstrap value is set to 96, 
and topology 2 is obtained from systematic stage when we increase the desired 
bootstrap to 100. The best obtained phylogenetic trees from selected topologies 
are provided in Table in this table Min. Bootstrap is higher than Avg. Boot strap, 
as the former represents the lowest bootstrap value of the best tree in the given 
topology, while Avg.Bootstrap consists of the average lowest bootstrap in all trees 
having this topology. 


Table 3. Information regarding obtained topologies 


Topology 

Min.Bootstrap 

Avg.Bootstrap 

Occurrences {f{x)) 

Gene rate (%) 

0 

88 

56 

5422 

64.7 

11 

96 

76 

2579 

44.8 

2 

100 

68 

787 

99.1 

8 

72 

50 

89 

44.8 

9 

49 

29 

48 

35.3 

14 

61 

48 

31 

25 

5 

80 

48 

21 

34.5 

20 

63 

53 

11 

53.4 

10 

62 

50 

8 

68.1 


As it can be noted, only 3 of the 43 obtained topologies contain trees whose 
lowest bootstrap is larger than 87, namely 0, 11, and 2. It is not so easy to make the 
decision, since all selected trees are very closed to each other with small differences. 
A new question needs to be answered: which genes are responsible for changing 
the tree from topologyg to topologyor to topology 2 ? Deep investigations are 
needed in future work to answer this new question and to discover the set of 
genes in group a, groups, CL^d group c that change one tree topology to another 
one (see Figure]^. 

The only notable difference between topologies 0 and 11 is the taxa position 
of Kalo_ septemlobus and Meta_ delavayi. In the same way, there is only one 
difference between topologies 0 and 11 with 2 : grouping the same two taxa of 
Kalo_ septemlobus and Met a _ delavayi. Different comparisons on trees provided 
with selected topologies are summarized in Figure 
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Omitted Genes Omitted Genes 


(a) 


(b) 




Fig. 7. Different comparisons of the topologies w.r.t the amount of removed genes: the 
number of disregarded genes in these figures is specified by ^ where n is the number 
of core genes, (a) Number of trees per topology, (b) number of trees whose lowest 
bootstrap is larger than or equal to 80, (c) lowest bootstrap in best trees, and (d) the 
average of lowest bootstraps. 


7 Conclusion 

In this study, an many stages pipeline have been applied (namely: systematic 
mode, random mode, GA stage one. Lasso test mode, and GA stage two) for 
inferring trustworthy phylogenetic trees from various plant groups. We have 
verified that inferring a phylogenetic tree based on either the full set or some 
subsets of common core genes does not always lead to good support of the 
phylogenetic reconstruction. In both systematic and random stages, many trees 
have been generated based on omitting randomly some genes. When the desired 
score was not reached, a genetic algorithm has then been applied inside two 
specific stages using previously generated trees, to find new optimized solutions 
after realizing crossover and mutation operations. Furthermore, we applied a 
Lasso test for identifying and removing systematically blurring genes, discarding 
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so those which have the worst impact on supports. We tested this pipeline on 322 
different plant groups, where 63 of them are base families while the remaining ones 
are random trees, these latter playing the rule of skeletons when reconstructing 
the supertree. A case study regarding Apiales order is analyzed and three “best” 
topologies stand out from the 43 obtained. Deep investigation will be needed in 
future work, in order to discover which genes change the topology, and to deeply 
investigate the sequences of the genes that blur the signal, to find the reasons of 
such effects. 
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